In this tutorial, we first briefly introduce the Bayesian paradigm for statistical inference and the concept of hierarchical Bayesian graphical model. We simulate some toy problems with R and illustrate how we can easily represent them and infer their parameters of interest by means of the Stan programming language and its interface Rstan for R. The second part of the tutorial is devoted to use the Bayesian hierarchical methodology to derive a set of Period-Luminosity-Metallicity relations from Gaia parallaxes and photometry in several bands.

1 Bayesian Graphical Models

1.1 The Bayesian paradigm

Bayesian inference is based on the Bayes rule:

\[ p\left(\Theta\mid\mathcal{D}\right)=\frac{p\left(\mathcal{D}\mid\Theta\right)\times p\left(\Theta\right)}{p\left(\mathcal{D}\right)}\,, \] where \(\mathcal{D}\) are the observations (data), \(\Theta\) are the parameters of a model proposed to explain the data and \(p\)’s represent (discrete or continuous) probability distributions. Given that the probability of the data is constant, we can drop it and rewrite the former expression as

\[ p\left(\Theta\mid\mathcal{D}\right)\propto p\left(\mathcal{D}\mid\Theta\right)\times p\left(\Theta\right)\,, \] where the right side represents the model, specified by the joint probability distribution \(p\left(\mathcal{D},\Theta\right)\) of the data and the parameters, which factorizes in:

  • The conditional distribution \(p\left(\mathcal{D}\mid\Theta\right)\) of the data given the parameters, the so called likelihood , which is a function of the parameters.
  • The prior distribution \(p\left(\Theta\right)\) of the parameters, which represents our knowledge about the parameters before observing the data.

The left side \(p\left(\Theta\mid\mathcal{D}\right)\) represents what we infer from the model, which is the joint posterior distribution of the parameters given the data. Computing this distribution requires computing the normalizing constant \(p\left(\mathcal{D}\right)\), which, in many cases, is an analytically untreatable task. Furthermore, if we only are interested in a subset \(\Phi\) of \(\left( \Theta \right)=\left( \Phi ,\Lambda \right)\), then we are faced to integrate (marginalize) the joint posterior distribution over the set \(\Lambda\) of nuisance parameters, which can be challenging again. Fortunately, these two obstacles may be circumvented by sampling from the posterior distribution by means of Markov Chain Monte Carlo (MCMC) simulation technics.

1.2 A toy problem, DAG models and Stan

Let us now consider a very simple problem: To infer the mean \(\mu\) of a \(\mathsf{N}\left(\mu=3,\sigma=0.5\right)\) assuming that we known its standard deviation \(\sigma=0.5\). We first simulate some data:

N = 3000
x.mu = 3
x.sigma = 0.5
x.samples = rnorm(N, x.mu , x.sigma)
hist(x.samples,breaks=100,xlab="x",main="",freq=FALSE)

A model for this problem is: \[ p\left(\mathcal{D},\mu\right)=\prod_{i=1}^{N}p\left(\hat{x}_{i}\mid\mu\right)\times p\left(\mu\right) \] where \(\hat{x}_{i}\mid\mu\sim\mathsf{N}\left(\mu,4\right)\), \(\mu\sim\mathsf{N}\left(0,\sigma_{0}\right)\), with \(\sim\) meaning distributed as, and \(\sigma_0\) is a hyperparameter that represents our uncertainty about our prior knowledge about the true value of the mean, which we assume to be 0.

For a better understanding of the dependency structure between variables it is customary to represent the model by using the Bayesian network formalism, which consists of drawing a DAG (directed acyclic graph) whose topology represents the factorization of the joint probability distribution \(p\left(\mathcal{D},\Theta\right)\). So, we assign to each node of the DAG a factor distribution of \(p\left(\mathcal{D},\Theta\right)\). If the factor distribution is a conditional distribution, then the node parents are the parameters of the distribution and if the factor distribution is an unconditional (marginal) distribution, the node has no parents.

Now, to declare and draw the DAG corresponding to our toy model we use the dot language and call the grViz function of the DiagrammeR package.

library(DiagrammeR)
grViz("
  digraph BGMexample1 {
    margin=0;
    ratio=0.6; 
    compound=true; 
    node[style=filled,shape=circle,color=black, fillcolor=cadetblue3,fixedsize=false];
    rankdir =TB;
    
    subgraph cluster0 {
      
      label='N';
      # Observations
      # https://github.com/rich-iannone/DiagrammeR/issues/71
      x_rec[label='x̂@_{i}',shape=doublecircle]; 
    }
    # Parameters
    mu[label='μ'];
    sigma[label='σ',shape=square];
   # Links to observed nodes
    mu -> x_rec;
    sigma -> x_rec;
  }
 ", width = 400, height = 400  )

To do the Bayesian analysis we need a probabilistic modelling language in which we can declare our model and a MCMC sampler to perform the inference. In this tutorial we use the Stan language and its default sampler NUTS (No-U-Turn sampler). To run Stan from R we load the interface rstan:

library(rstan)
library(ggplot2)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

We define now the Stan model for our toy problem:

stanBGM1 <-'
  data {                     
    int<lower=0> N;            
    vector[N] obsx;           
    real<lower=0> sigma;
    real<lower=0> sigma0;
  }
  parameters {                
    real mu;
  }
  model {                    
    //prior
    mu ~ normal(0, sigma0);             
 
    //likelihood
    for (n in 1:N) {
      obsx[n] ~ normal(mu, sigma);
    }
  }
'

Now, we load our data into a list, call the sampler and store the results of the fit:

# Load data 
dat = list(N = N, obsx=x.samples, sigma=x.sigma, sigma0=100)

set.seed(12354)


fit1 <- stan(model_code = stanBGM1, data = dat, chains = 3,  iter = 6000  , warmup = 3000, verbose = FALSE)

Finally, we print the results and plot the Markov chains and a histogram corresponding to the posterior samples of \(\mu\):

# summary
print(fit1, digits_summary=4)
Inference for Stan model: b8858ef90e762465aa83bc5f8a8e20d3.
3 chains, each with iter=6000; warmup=3000; thin=1; 
post-warmup draws per chain=3000, total post-warmup draws=9000.

          mean se_mean     sd      2.5%       25%       50%       75%     97.5% n_eff   Rhat
mu      3.0346  0.0004 0.0225    2.9899    3.0196    3.0349    3.0500    3.0776  3002 1.0018
lp__ -228.4948  0.0118 0.7181 -230.4073 -228.6794 -228.2175 -228.0369 -227.9874  3712 1.0005

Samples were drawn using NUTS(diag_e) at Mon Nov 13 20:00:04 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
stan_trace(fit1,pars=c("mu"),unconstrain = FALSE)

stan_hist(fit1,pars=c("mu"),unconstrain = FALSE)

1.3 Hierarchical Modelling

A hierarchical Bayesian model (HBM) divides the variability of a statistical problem into several levels.

Consider the following problem: To infer the mean \(\mu\) and the standard deviation \(\sigma\) of a Gaussian population from \(N\) samples of observed values with their corresponding uncertainties. First, let us to simulate some data:

N <- 500
x.mu <- 3
x.sigma <- 0.5
x.samples <- rnorm(N, x.mu , x.sigma)
x.obs.errors <- rnorm(N,0,1) 
x.obs.errors <- abs(x.obs.errors)
x.obs.samples <- rnorm(N, x.samples, x.obs.errors)  
hist(x.samples,breaks=100,xlab="x",main="",freq=FALSE,xlim=c(0,6),ylim=c(0,1))

hist(x.obs.samples,breaks=100,xlab="x",main="",freq=FALSE,xlim=c(0,6),ylim=c(0,1))

A suitable model for this problem is:

\[ p\left(\mathcal{D},\mu\right)=p\left(\mu\right)\cdot p\left(\sigma\right)\prod_{i=1}^{N}p\left(\hat{x}_{i}\mid x_{i}\right)\cdot p\left(x_{i}\mid\mu,\sigma\right) \]

where \(\hat{x}_{i}\mid x_{i}\sim\mathsf{N}\left(x_{i},\sigma_{x_{i}}\right)\), \(x_{i}\mid\mu,\sigma\sim\mathsf{N}\left(\mu,\sigma\right)\), \(\mu\sim\mathsf{N}\left(\mu_{0},\sigma_{0}\right)\) and \(\sigma\sim\mathsf{U}\left(0,k\right)\). That is, we distinguish between the intrinsic variability of the statistical problem (the variability at a population level) and the variability of each observation.

Now, we declare and draw the DAG corresponding to our HBM:

grViz("
  digraph BGMexample2 {
    margin=0;
    ratio=0.6; 
    compound=true; 
    node[style=filled,shape=circle,color=black, fillcolor=gray,fixedsize=false];
    rankdir =TB;
    
    subgraph cluster0 {
      
      label='N';
      
      # Parameters
      x[label='x@_{i}'];
      # Observations
      # https://github.com/rich-iannone/DiagrammeR/issues/71
      x_rec[label='x&#770;@_{i}',shape=doublecircle,fillcolor=deepskyblue1]; 
    }
    # Hyperparameters
    mu[label='&mu;',fillcolor=darkseagreen3];
    sigma[label='&sigma;',fillcolor=darkseagreen3];
   # Links 
    mu -> x;
    sigma -> x;
    x -> x_rec;
  }
 ", width = 600, height = 600  )

We define now the Stan model for our second problem:

stanBGM2 <-'
  data {                     
    int<lower=0> N;            
    vector[N] obsx;           
    vector<lower=0>[N] sigmaobsx;       
  }
  parameters {                
    real mu;
    real<lower=0> sigma;
    vector[N] x;   
  }
  model {                    
    # hyperpriors
    mu ~ normal(0, 10); 
    sigma ~ uniform(0, 10);    
    # priors
    for (n in 1:N) {
      x[n] ~ normal(mu, sigma);
    
    # Likelihood  
      obsx[n] ~ normal(x[n], sigmaobsx[n]);
    }
  }
'

Finally, we call the sampler and print summary statistics of the posterior samples for the the parameters \(\mu\) and \(\sigma\).

# Create the data list object
dat = list(N = N, obsx=x.obs.samples, sigmaobsx=x.obs.errors)


fit1 <- stan(model_code = stanBGM2, chains = 1, data = dat, iter = 1, verbose = FALSE)

set.seed(12354)

fit2 = stan(fit=fit1 , data = dat,  iter = 5000  , warmup = 3000, chains = 3)
# summary
print(fit2, pars=c("mu","sigma"), digits_summary=4)
Inference for Stan model: 59c7a279bc3a48dfa2b40aec2351c5fc.
3 chains, each with iter=5000; warmup=3000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=6000.

        mean se_mean     sd   2.5%    25%    50%    75%  97.5% n_eff   Rhat
mu    3.0301   5e-04 0.0355 2.9606 3.0058 3.0307 3.0540 3.0994  4170 1.0004
sigma 0.5194   8e-04 0.0320 0.4595 0.4978 0.5182 0.5404 0.5858  1467 1.0027

Samples were drawn using NUTS(diag_e) at Mon Nov 13 19:53:02 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

In the former example we have modelled the mean \(x_i\) of each individual observation \(\hat x_i\) and the mean \(\mu\) at a population level. As a consequence, we obtain shrunken (regularized) posterior estimates for each \(x_i\). Let us compare the results with a non-hierarchical pooled estimation:

stanBGM2pooled <-'
  data {                     
    int<lower=0> N;            
    vector[N] obsx;           
    vector<lower=0>[N] sigmaobsx;       
  }
  parameters {                
    real mu;
  }
  model {                    
    # priors
    mu ~ normal(0, 10); 
    for (n in 1:N) {
    # Likelihood  
      obsx[n] ~ normal(mu, sigmaobsx[n]);
    }
  }
'
# Create the data list object
dat = list(N = N, obsx=x.obs.samples, sigmaobsx=x.obs.errors)


fit21 <- stan(model_code = stanBGM2pooled, chains = 1, data = dat, iter = 1, verbose = FALSE)

set.seed(12354)

fit22 = stan(fit=fit21 , data = dat,  iter = 5000  , warmup = 3000, chains = 3)
# summary
print(fit22, digits_summary=4)
Inference for Stan model: 862a9dfe0d272562936ee493dbcbae43.
3 chains, each with iter=5000; warmup=3000; thin=1; 
post-warmup draws per chain=2000, total post-warmup draws=6000.

           mean se_mean     sd       2.5%        25%        50%        75%      97.5% n_eff   Rhat
mu       2.8231  0.0001 0.0054     2.8125     2.8195     2.8231     2.8267     2.8335  1993 1.0001
lp__ -5602.8974  0.0145 0.7209 -5604.9294 -5603.0816 -5602.6151 -5602.4323 -5602.3840  2464 1.0002

Samples were drawn using NUTS(diag_e) at Mon Nov 13 19:53:49 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

We see that the hierarchical model improves the estimation of the true mean \(\mu=\) 3. In fact, the mean value of \(\mu\) inferred by the the HBM is 3.0301156 while the mean value inferred by the pooled estimator is 2.8230958.

1.4 Hierarchical Modelling for Regression

In this section we construct a HBM that infers the parameters of an intrinsic linear dependency relationship of the variable y on the variable x taking into account the measurement errors of both variables.

# 
N = 200
x.mu = -0.27790
x.sigma = 0.076

x.samples = rnorm(N, x.mu , x.sigma)
x.obs.errors <- rnorm(N,0,0.1) 
x.obs.errors <- abs(x.obs.errors)
x.obs.samples <- rnorm(N, x.samples, x.obs.errors)  

# 
# 
y.slope = -2.73
y.intercept = -1.24
y.sigma = 0.05
y.samples = rnorm(N, y.slope*x.samples + y.intercept , y.sigma)
y.obs.errors <- rnorm(N,0,0.1)
y.obs.errors <- abs(y.obs.errors)
y.obs.samples <- rnorm(N, y.samples, y.obs.errors) 


plot(x.samples, y.samples, asp = 1 )
plot(x.obs.samples, y.obs.samples, asp = 1 )

hist(y.samples,breaks=100)
hist(10^(x.samples),breaks=100)
cor(x.samples,y.samples)
cor(x.obs.samples,y.obs.samples)
# Create the stan model object
stanmodelcode <-'
data {                      
  int<lower=0> N;           
  vector[N] obsx;              
  vector[N] obsy;             
  vector<lower=0>[N] sigmaobsx;       
  vector<lower=0>[N] sigmaobsy;        
}
parameters {                 
  real mu;
  real alpha;               
  real beta1;
  real<lower=0> sigma;
  real<lower=0> sigmax;
  vector[N] x;               
  vector[N] y;
}
model {                    
  //priors
  alpha ~ normal(0, 10);             
  beta1 ~ normal(0, 10);
  sigma ~ uniform(0, 10);   
  sigmax ~ uniform(0, 10);
 
//likelihood
  for (n in 1:N) {
    x[n] ~ normal(mu, sigmax);
    y[n] ~ normal(alpha + beta1 * x[n], sigma);
    
    // Likelihood components

    obsx[n] ~ normal(x[n], sigmaobsx[n]);
    obsy[n] ~ normal(y[n], sigmaobsy[n]);

  }
}
'
# Create the data list object
dat = list(N = N, obsy=y.obs.samples, obsx=x.obs.samples,sigmaobsy=y.obs.errors,sigmaobsx=x.obs.errors)


fit1 <- stan(model_code = stanmodelcode, chains = 1, data = dat, iter = 1, verbose = FALSE)

set.seed(12354)

fit2 = stan(fit=fit1 , data = dat,  iter = 5000  , warmup = 3000, chains = 3)

# summary

print(fit2, digits_summary=4)

2 Bayesian models to Infer PL(Z) relations using Gaia parallaxes

In this section we construct a HBM to infer a Period-Lumninosity-Metallicity relation for RRL ab stars using Gaia parallaxes and photometry in the K-Band.

We first import the local file RRLyrae200Dambis.dat into a data frame and compute some derived parameters:


full <- read.table("RRLyrae200Dambis.dat",header=TRUE,na.strings = "NA",
                   colClasses =c(rep("character",3),rep("numeric",6),"character",
                                 rep("numeric",2),rep("character",1),rep("numeric",11)) )


n <- nrow(full)
full$K0    <- with(full, X.Kmag. - 0.114*AV ) 
full$em    <- with(full, e_.Kmag.) # Does not include uncertainty in dereddening
full$elogP <- with(full, LogP/(LogP*100) )   
full$emet  <- with(full, 0.2*Met_Harris/(Met_Harris) )     
#attach(full)
str(full)

Colums of data frame full include, for the n stars in the sample:

\(\left(\hat{m}_{i},\log\hat{P}_{i},\hat{\varpi}_{i},\hat{Z}_{i},\hat{A}_{m_{i}}\right)\)

Now, we are going to construct the HBM. Let us to present first the resulting DAG associated to the model. In the following paragraphs we detail its construction procedure.

We start modelling the stochastic linear relationship

\[ M_{i}=b+c\log P_{i}+kZ_{i} \]

where \(M_{i}\), \(\log P_{i}\) and \(Z_{i}\) are, respectively, the absolute magnitude, the decadic logarithm of the intrinsic period and the intrinsic metallicity of the i-th star in the sample. Note that we do not know the values of the intrinsic (true) astrophysical parameters \(\log P_{i}\) and \(Z_{i}\), but we do know their measured values \(\log\hat{P}_{i}\) and \(\hat{Z}_{i}\) and associated uncertainties. To account for the stochastic nature of the former relation, we assign it an intrinsic width \(w\). The main objective of the model is then to infer the parameters \(b\), \(c\), \(k\) and \(w\), where \(c=\tan\left(\phi_{1}\right)\) and \(k=\tan\left(\phi_{2}\right)\).

Now, we need to relate absolute magnitudes to apparent magnitudes. For that we use the deterministic relation \[ m_{i}=M_{i}-5\log\varpi_{i}+10+A_{m_{i}} \] , where \(m_{i}\) is the true reddened apparent magnitude and \(\varpi_{i}\) is the true parallax measured in mas. Note, again, that we do not know the values of true reddened apparent magnitudes and parallaxes but we do know their measured values \(\hat{m}_{i}\), \(\hat{\varpi}_{i}\) and uncertainties.

At this point we put into the DAG associated to our HBM, a rectangle that replicates with the \(n\) stars in the sample and include in it the nodes corresponding to extinctions \(A_{m_{i}}\) and true astrophysical parameters \(m_{i}\), \(M_{i}\), \(\log P_{i}\) and \(Z_{i}\) , and the arcs between these nodes, which are given by the two relations above. For example, we draw solid (stochastic) arcs from \(M_{i}\) to \(\log P_{i}\) and \(Z_{i}\) and dashed (deterministic) arcs from \(m_{i}\) to \(M_{i}\) and \(\varpi_{i}\). Outside the rectangle, we include the parameters \(b\), \(\phi_1\), \(\phi_2\) and \(w\) of the PLZ relation and trace the arcs from \(M_{i}\) to those nodes. We also set the distribution of the i-th absolute magnitude to a Gaussian of mean \(b+c\log P_{i}+kZ_{i}\) and standard deviation \(w\).

We continue the construction of the model and the DAG linking the measured astrophysical parameters to the true ones. So, we include \(\hat{m}_{i}\), \(\log\hat{P}_{i}\), \(\hat{\varpi}_{i}\) and \(\hat{Z}_{i}\) inside the rectangle of the DAG and trace the corresponding solid (stochastic) arcs to the true parameters. To each measured astrophysical parameter we assign a Gaussian distribution with its mean equal to the value of the true parameter and its standard deviation equal to the uncertainty provided by our data set. For example, for the measured apparent magnitude we declare \(\hat{m}_{i}\sim\mathsf{N}\left(m_{i},\sigma_{m_{i}}\right)\).

Next, we assign prior probability distributions to the true astrophysical parameters \(\log P_{i}\), \(Z_{i}\) and \(\varpi_{i}\) and trace the DAG arcs from these astrophysical parameters to the parameters of the distributions assigned to them. We set \(\log P_{i}\sim\mathsf{N}\left(\mu_{P}=0,\sigma_{P}=2\right)\), \(Z_{i}\sim\mathsf{N}\left(\mu_{Z}=0,\sigma_{Z}=5\right)\) and \(\varpi_{i}\sim\mathsf{logN}\left(\beta,\gamma\right)\). Note that the hyperparameters of the \(\log P_{i}\) and \(Z_{i}\) distributions are constant quantities while the hyperparameters \(\beta\) and \(\gamma\) of the \(\varpi_{i}\) distribution are random. For these latter ones we assign the hyperpriors \(\beta\sim\mathsf{N}\left(\mu_{\beta}=0,\sigma_{\beta}=2\right)\) and \(\gamma\sim\mathsf{Exp}\left(1\right)\).

To finish the BGM construction, we assign the following prior distributions to the parameters of the PLZ relation: \(b\sim\mathsf{N}\left(\mu_{\beta}=0,\sigma_{\beta}=10\right)\) to the intercept, \(\phi_{1},\phi_{2}\sim\mathsf{N}\left(\mu_{\phi}=0,\sigma_{\phi}=3.14/2\right)\) to the slope angles and \(w\sim\mathsf{Exp}\left(\lambda_{w}=1\right)\) to the spread.

We define now the Stan model for our problem:

stanPLZcode <-'
data {
  int<lower=1> N; // number of observations
  real pi[N];     // observed parallax
  real m[N];      // observed apparent magnitude 
  real logP[N];   // decadic logarithm of observed period 
  #real alpha[N];  // observed right ascension
  #real delta[N];  // observed declination
  real met[N];    // observed metallicity 
  real AV[N];
  
  real epi[N];    // measurement error of observed parallax
  real em[N];     // measurement error of observed (reddened) apparent magnitude 
  real elogP[N];  // measurement error of logarithm of observed period 
  real ealpha[N]; // measurement error of observed right ascension
  real edelta[N]; // measurement error of observed declination
  real emet[N];   // uncertainty of observed metallicity 
}
parameters {
  vector<lower=0>[N] pi0;   // true parallax
  vector[N] logP0;          // decadic logarithm of true period 
  vector[N] met0;           // true metallicity 
  vector[N] M;
  real angle;
  real b;
  real c;
  real beta;
  real gamma;
  real width;
  #real eAV;
}
model {
  #real ps;
  real a;
  real d;
  vector[N] emtot;
  #real emtot[N];
  width ~ exponential(1);
  beta ~ normal(0.0,2.0); 
  gamma ~ exponential(1);
  #eAV ~ normal(0,0.3); # RELATIVE error in extinction
  
  angle ~ normal(0,3.1416/2.0);
  b ~ normal(0,3.0);
  # c ~ normal(0,0.5);
  c ~ normal(0,3.1416/2.0);
  a = tan(angle);
  d = tan(c);
  for (i in 1:N) {
  pi0[i] ~ lognormal(beta,gamma);
  logP0[i] ~ normal(0.0,2);
  met0[i] ~ normal(0,5);
  M[i] ~ normal(a*logP0[i] + d*met0[i] + b, width); 
  #emtot[i] = sqrt(em[i]^2 + (0.114*AV[i]*fabs(eAV))^2);  
  #AV[i] ~ normal(0,0.2);
  # likelihood
  m[i] ~ normal(M[i] - (5.0*log10(pi0[i])) + 10.0 +AV[i] , em[i]);
  pi[i] ~ normal(pi0[i], epi[i]);
  logP[i] ~ normal(logP0[i],elogP[i]);
  met[i] ~ normal(met0[i],emet[i]);
  }
#  for (i in 1:N) {
#        ps = 
#          normal_lpdf(m[i]|(a*logP0[i]+d*met0[i]+b)+10.0-(5.0*log10(pi0[i]))  ,emtot[i]+width)+
#              normal_lpdf(pi[i]|pi0[i],epi[i])+
#              normal_lpdf(logP[i]|logP0[i],elogP[i])+
#              normal_lpdf(met[i]|met0[i],emet[i]);
#        target += ps;
#    }
}
'

We load the training set into a list, initialize Markov chains to random values, call the sampler, store the fit and extract posterior samples.

# Create data list object

linear_data <- list(N=nrow(full), pi=full$parallax, epi=full$parallax_error,
                    m=full$K0, em=full$em,
                    logP=full$LogP, elogP=full$elogP,
                    alpha=full$ra, delta=full$dec,
                    ealpha=full$ra_error, edelta=full$dec_error,
                    met=full$Met_Harris, emet=full$emet,
                    AV=full$AV 
                    )


str(linear_data)


initf2 <- function(chain_id = 1) {
  set.seed(chain_id)

# Create empty variables
  initpi = rep(NA,n)
  initlogp = rep(NA,n)
  initmet = rep(NA,n)
  initM = rep(NA,n)
# Initialize parameters
  m=rnorm(1,0,1.0) 
  angle = atan(m)
  b=rnorm(1,-1.0,0.3)
  c = rnorm(1,0,0.5)
  d = atan(c)

  width=rexp(1,1.0)
  beta=rnorm(1,0.0,2.0)
  gamma=rexp(1,1.0)
  eAV=rexp(1,1.0)
  
  for (i in 1:n) {
    initlogp[i] = rnorm(1,full$LogP[i],0.01)
    initmet[i] = rnorm(1,full$Met_Harris,0.5)
    initM[i] = rnorm(1, m*full$LogP[i] + c*full$Met_Harris[i] + b, width)
    initpi[i] =  10^((initM[i]- full$K0[i]+10.0)/5.0) # Initialize almost at random (m,b,d can take almost any value)
    
    
  }



 list(pi0=initpi,
         logP0=initlogp, M=initM,
          angle=angle,
          b=b, beta=beta, gamma=gamma, c=c, width=width,eAV=eAV)
 
 #list(angle=angle, b=b, beta=beta, gamma=gamma, c=c, width=width)
 
 

} 

# generate a list of lists to specify initial values
#n_chains <- 4
n_chains <- 1
init_ll <- lapply(1:n_chains, function(id) initf2(chain_id = id))

nsamples_chain =5000 #LSB
#nsamples_chain =1000
nsamples <- n_chains*nsamples_chain/2

fit <- stan(model_code = stanPLZcode, data = linear_data, iter = nsamples_chain, 
            chains = n_chains,init=init_ll,control= list(adapt_delta = 0.8), verbose = TRUE,init_r=2)

#fit <- stan(model_code = stanPLZcode, data = linear_data, iter = nsamples_chain, 
#            chains = n_chains,init="random",control= list(adapt_delta = 0.8), verbose = TRUE,init_r=2)

#draws <- extract(fit)

Finally, we plot the marginal posterior distributions from the MCMC samples.The diagonal shows the uni-dimensional marginal distributions for the intrinsic width of the relationship, the angle of the metallicity term in the linear relation, the intercept and the angle of the log(P) term.

pairs(fit, pars = c("width", "c", "b","angle"), labels=c("Intrinsic Width", "Metallicity Angle", "Intercept","log(P) Angle"), 
      log = FALSE, las = 1)

---
title: "R notebook tutorial about hierarchical Bayesian models to Infer PL(Z) relations using Gaia parallaxes"
author: 
- Héctor E. Delgado
- Luis M. Sarro
date:  "`r format(Sys.time(), '%d/%m/%Y')`"
header-includes: 
  - \usepackage{tikz}
  - \usepackage{pgfplots}
output: 
    html_notebook:
      toc: true
      number_sections: true
---

<style>
body {
text-align: justify}
</style>



In this tutorial, we first briefly introduce the Bayesian paradigm for statistical inference and the concept of hierarchical Bayesian graphical model. We simulate some toy problems with R and illustrate how we can easily represent them and infer their parameters of interest by means of the Stan programming language and its interface Rstan for R. The second part of the tutorial is devoted to use the Bayesian hierarchical methodology to derive a set of Period-Luminosity-Metallicity relations from Gaia parallaxes and photometry in several bands.     



# Bayesian Graphical Models

## The Bayesian paradigm 

Bayesian inference is based on the Bayes rule:

$$
p\left(\Theta\mid\mathcal{D}\right)=\frac{p\left(\mathcal{D}\mid\Theta\right)\times p\left(\Theta\right)}{p\left(\mathcal{D}\right)}\,,
$$
where $\mathcal{D}$ are the observations (data), $\Theta$ are the parameters of a model proposed to explain the data and $p$'s represent (discrete or continuous) probability distributions. Given that the probability of the data is constant, we can drop it and rewrite the former expression as

$$
p\left(\Theta\mid\mathcal{D}\right)\propto p\left(\mathcal{D}\mid\Theta\right)\times p\left(\Theta\right)\,,
$$
where the right side represents the model, specified by the joint probability distribution $p\left(\mathcal{D},\Theta\right)$ of the data and the parameters, which factorizes in:

* The conditional distribution $p\left(\mathcal{D}\mid\Theta\right)$ of the data given the parameters, the so called  *likelihood *, which is a function of the parameters.
* The *prior distribution* $p\left(\Theta\right)$ of the parameters, which represents our knowledge about the parameters before observing the data.

The left side $p\left(\Theta\mid\mathcal{D}\right)$ represents what we infer from the model, which is the *joint posterior distribution* of the parameters given the data. Computing this distribution requires computing the normalizing constant $p\left(\mathcal{D}\right)$, which, in many cases, is an analytically untreatable task. Furthermore, if we only are interested in a subset $\Phi$ of $\left( \Theta  \right)=\left( \Phi ,\Lambda  \right)$, then we are faced to integrate (marginalize) the joint posterior distribution over the set $\Lambda$ of *nuisance parameters*, which can be challenging again. Fortunately, these two obstacles may be circumvented by sampling from the posterior distribution by means of Markov Chain Monte Carlo (MCMC) simulation technics.   



## A toy problem, DAG models and Stan 

Let us now consider a very simple problem: To infer the mean $\mu$ of a $\mathsf{N}\left(\mu=3,\sigma=0.5\right)$ assuming that we known its standard deviation $\sigma=0.5$. We first simulate some data:

```{r}
N <- 3000
x.mu <- 3
x.sigma <- 0.5
x.samples <- rnorm(N, x.mu , x.sigma)
hist(x.samples,breaks=100,xlab="x",main="",freq=FALSE)
```

A model for this problem is:
$$
p\left(\mathcal{D},\mu\right)=\prod_{i=1}^{N}p\left(\hat{x}_{i}\mid\mu\right)\times p\left(\mu\right)
$$
where $\hat{x}_{i}\mid\mu\sim\mathsf{N}\left(\mu,4\right)$, $\mu\sim\mathsf{N}\left(0,\sigma_{0}\right)$, with $\sim$ meaning *distributed as*, and $\sigma_0$ is a hyperparameter that represents our uncertainty about our prior knowledge about the true value of the mean, which we assume to be 0. 

For a better understanding of the  dependency structure between variables it is customary to represent the model by using the Bayesian network formalism, which consists of drawing a DAG (directed acyclic graph) whose topology represents the factorization of the joint probability distribution $p\left(\mathcal{D},\Theta\right)$. So, we assign to each node of the DAG a factor distribution of $p\left(\mathcal{D},\Theta\right)$. If the factor distribution is a conditional distribution, then the node parents are the parameters of the distribution and if the factor distribution is an unconditional (marginal) distribution, the node has no parents.    


Now, to declare and draw the DAG corresponding to our toy model we use the [dot language](http://www.graphviz.org/Documentation/dotguide.pdf) and call the `grViz` function of the `DiagrammeR` package.   

```{r}
library(DiagrammeR)
```

```{r}
grViz("
  digraph BGMexample1 {

    margin=0;
    ratio=0.6; 
    compound=true; 
    node[style=filled,shape=circle,color=black, fillcolor=cadetblue3,fixedsize=false];
    rankdir =TB;
    
    subgraph cluster0 {
      
      label='N';

      # Observations
      # https://github.com/rich-iannone/DiagrammeR/issues/71
      x_rec[label='x&#770;@_{i}',shape=doublecircle]; 
    }

    # Parameters
    mu[label='&mu;'];
    sigma[label='&sigma;',shape=square];

   # Links to observed nodes
    mu -> x_rec;
    sigma -> x_rec;
  }
 ", width = 400, height = 400  )
```

To do the Bayesian analysis we need a probabilistic modelling language in which we can declare our model and a MCMC sampler to perform the inference. In this tutorial we use the [Stan language](http://mc-stan.org/users/documentation/index.html) and its default sampler NUTS (No-U-Turn sampler). To run Stan from R we load the interface [rstan](https://cran.r-project.org/web/packages/rstan/rstan.pdf):


```{r}
library(rstan)
library(ggplot2)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
```

We define now the Stan model for our toy problem:

```{r}
stanBGM1 <-'
  data {                     
    int<lower=0> N;            
    vector[N] obsx;           
    real<lower=0> sigma;
    real<lower=0> sigma0;
  }
  parameters {                
    real mu;
  }
  model {                    
    //prior
    mu ~ normal(0, sigma0);             
 
    //likelihood
    for (n in 1:N) {
      obsx[n] ~ normal(mu, sigma);
    }
  }
'
```

Now, we load our data into a list, call the sampler and store the results of the fit:


```{r}
# Load data 
dat = list(N = N, obsx=x.samples, sigma=x.sigma, sigma0=100)

set.seed(12354)


fit1 <- stan(model_code = stanBGM1, data = dat, chains = 3,  iter = 6000  , warmup = 3000, verbose = FALSE)

```



Finally, we print the results and plot the Markov chains and a histogram corresponding to the posterior samples of $\mu$:  

```{r}
# summary
print(fit1, digits_summary=4)
stan_trace(fit1,pars=c("mu"),unconstrain = FALSE)
stan_hist(fit1,pars=c("mu"),unconstrain = FALSE)
```





## Hierarchical Modelling


A hierarchical Bayesian model (HBM) divides the variability of a statistical problem into several levels. 

Consider the following problem: To infer the mean $\mu$ and the standard deviation $\sigma$ of a Gaussian population from $N$ samples of observed values with their corresponding uncertainties. First, let us to simulate some data:


```{r}
N <- 500
x.mu <- 3
x.sigma <- 0.5
x.samples <- rnorm(N, x.mu , x.sigma)
x.obs.errors <- rnorm(N,0,1) 
x.obs.errors <- abs(x.obs.errors)
x.obs.samples <- rnorm(N, x.samples, x.obs.errors)  
hist(x.samples,breaks=100,xlab="x",main="",freq=FALSE,xlim=c(0,6),ylim=c(0,1))
hist(x.obs.samples,breaks=100,xlab="x",main="",freq=FALSE,xlim=c(0,6),ylim=c(0,1))
```


A suitable model for this problem is:

$$
p\left(\mathcal{D},\mu\right)=p\left(\mu\right)\cdot p\left(\sigma\right)\prod_{i=1}^{N}p\left(\hat{x}_{i}\mid x_{i}\right)\cdot p\left(x_{i}\mid\mu,\sigma\right)
$$

where $\hat{x}_{i}\mid x_{i}\sim\mathsf{N}\left(x_{i},\sigma_{x_{i}}\right)$, $x_{i}\mid\mu,\sigma\sim\mathsf{N}\left(\mu,\sigma\right)$, $\mu\sim\mathsf{N}\left(\mu_{0},\sigma_{0}\right)$ and $\sigma\sim\mathsf{U}\left(0,k\right)$. That is, we distinguish between the intrinsic variability of the statistical problem (the variability at a population level) and the variability of each observation.


Now, we declare and draw the DAG corresponding to our HBM:

```{r}
grViz("
  digraph BGMexample2 {

    margin=0;
    ratio=0.6; 
    compound=true; 
    node[style=filled,shape=circle,color=black, fillcolor=gray,fixedsize=false];
    rankdir =TB;
    
    subgraph cluster0 {
      
      label='N';
      
      # Parameters
      x[label='x@_{i}'];


      # Observations
      # https://github.com/rich-iannone/DiagrammeR/issues/71
      x_rec[label='x&#770;@_{i}',shape=doublecircle,fillcolor=deepskyblue1]; 
    }

    # Hyperparameters
    mu[label='&mu;',fillcolor=darkseagreen3];
    sigma[label='&sigma;',fillcolor=darkseagreen3];

   # Links 
    mu -> x;
    sigma -> x;
    x -> x_rec;
  }
 ", width = 600, height = 600  )
```

We define now the Stan model for our second problem:

```{r}
stanBGM2 <-'
  data {                     
    int<lower=0> N;            
    vector[N] obsx;           
    vector<lower=0>[N] sigmaobsx;       
  }
  parameters {                
    real mu;
    real<lower=0> sigma;
    vector[N] x;   
  }
  model {                    
    # hyperpriors
    mu ~ normal(0, 10); 
    sigma ~ uniform(0, 10);    

    # priors
    for (n in 1:N) {
      x[n] ~ normal(mu, sigma);
    
    # Likelihood  
      obsx[n] ~ normal(x[n], sigmaobsx[n]);
    }
  }
'
```

Finally, we  call the sampler and  print summary statistics of the posterior samples for the the parameters $\mu$ and $\sigma$.

```{r}
# Create the data list object
dat = list(N = N, obsx=x.obs.samples, sigmaobsx=x.obs.errors)


fit1 <- stan(model_code = stanBGM2, chains = 1, data = dat, iter = 1, verbose = FALSE)

set.seed(12354)

fit2 = stan(fit=fit1 , data = dat,  iter = 5000  , warmup = 3000, chains = 3)

```

```{r}
# summary
print(fit2, pars=c("mu","sigma"), digits_summary=4)
```


In the former example we have modelled the mean $x_i$ of each individual observation  $\hat x_i$  and the mean $\mu$ at a population level. As a consequence, we obtain shrunken (regularized) posterior estimates for each $x_i$. Let us compare the results with a non-hierarchical pooled estimation:

```{r}
stanBGM2pooled <-'
  data {                     
    int<lower=0> N;            
    vector[N] obsx;           
    vector<lower=0>[N] sigmaobsx;       
  }
  parameters {                
    real mu;
  }
  model {                    
    # priors
    mu ~ normal(0, 10); 

    for (n in 1:N) {

    # Likelihood  
      obsx[n] ~ normal(mu, sigmaobsx[n]);
    }
  }
'
```

```{r}
# Create the data list object
dat = list(N = N, obsx=x.obs.samples, sigmaobsx=x.obs.errors)


fit21 <- stan(model_code = stanBGM2pooled, chains = 1, data = dat, iter = 1, verbose = FALSE)

set.seed(12354)

fit22 = stan(fit=fit21 , data = dat,  iter = 5000  , warmup = 3000, chains = 3)


```

```{r}
# summary
print(fit22, digits_summary=4)
```


We see that the hierarchical model improves the estimation of the true mean $\mu=$ `r x.mu`. In fact,  the mean value of $\mu$ inferred by the the HBM is `r summary(fit2)$summary[,"mean"]["mu"]` while the mean value inferred by the pooled estimator is   `r summary(fit22)$summary[,"mean"]["mu"]`.


## Hierarchical Modelling for Regression

In this section we construct a HBM that infers the parameters of an intrinsic linear dependency relationship of the variable y on the variable x taking into account the measurement errors of both variables.



```{r}
# 
N = 200
x.mu = -0.27790
x.sigma = 0.076

x.samples = rnorm(N, x.mu , x.sigma)
x.obs.errors <- rnorm(N,0,0.1) 
x.obs.errors <- abs(x.obs.errors)
x.obs.samples <- rnorm(N, x.samples, x.obs.errors)  

# 
# 
y.slope = -2.73
y.intercept = -1.24
y.sigma = 0.05
y.samples = rnorm(N, y.slope*x.samples + y.intercept , y.sigma)
y.obs.errors <- rnorm(N,0,0.1)
y.obs.errors <- abs(y.obs.errors)
y.obs.samples <- rnorm(N, y.samples, y.obs.errors) 


plot(x.samples, y.samples, asp = 1 )
plot(x.obs.samples, y.obs.samples, asp = 1 )

hist(y.samples,breaks=100)
hist(10^(x.samples),breaks=100)
cor(x.samples,y.samples)
cor(x.obs.samples,y.obs.samples)
```




```{r}
# Create the stan model object
stanmodelcode <-'
data {                      
  int<lower=0> N;           
  vector[N] obsx;              
  vector[N] obsy;             
  vector<lower=0>[N] sigmaobsx;       
  vector<lower=0>[N] sigmaobsy;        
}
parameters {                 
  real mu;
  real alpha;               
  real beta1;
  real<lower=0> sigma;
  real<lower=0> sigmax;
  vector[N] x;               
  vector[N] y;
}
model {                    
  //priors
  alpha ~ normal(0, 10);             
  beta1 ~ normal(0, 10);
  sigma ~ uniform(0, 10);   
  sigmax ~ uniform(0, 10);
 
//likelihood
  for (n in 1:N) {
    x[n] ~ normal(mu, sigmax);
    y[n] ~ normal(alpha + beta1 * x[n], sigma);
    
    // Likelihood components

    obsx[n] ~ normal(x[n], sigmaobsx[n]);
    obsy[n] ~ normal(y[n], sigmaobsy[n]);

  }
}
'
```


```{r}
# Create the data list object
dat = list(N = N, obsy=y.obs.samples, obsx=x.obs.samples,sigmaobsy=y.obs.errors,sigmaobsx=x.obs.errors)


fit1 <- stan(model_code = stanmodelcode, chains = 1, data = dat, iter = 1, verbose = FALSE)

set.seed(12354)

fit2 = stan(fit=fit1 , data = dat,  iter = 5000  , warmup = 3000, chains = 3)

# summary

print(fit2, digits_summary=4)
```




# Bayesian models to Infer PL(Z) relations using Gaia parallaxes

In this section we construct a HBM to infer a Period-Lumninosity-Metallicity relation for RRL ab stars using Gaia parallaxes and photometry in the K-Band. 

We first import the local file `RRLyrae200Dambis.dat` into a data frame and compute some derived parameters:


```{r}

full <- read.table("RRLyrae200Dambis.dat",header=TRUE,na.strings = "NA",
                   colClasses =c(rep("character",3),rep("numeric",6),"character",
                                 rep("numeric",2),rep("character",1),rep("numeric",11)) )


n <- nrow(full)
full$K0    <- with(full, X.Kmag. - 0.114*AV ) 
full$em    <- with(full, e_.Kmag.) # Does not include uncertainty in dereddening
full$elogP <- with(full, LogP/(LogP*100) )   
full$emet  <- with(full, 0.2*Met_Harris/(Met_Harris) )     
#attach(full)
str(full)
```



Colums of data frame `full` include, for the `n` stars in the sample:

* The (measured) K-band apparent magnitudes (`K0`), decadic logarithm of periods  (`LogP`), parallaxes (`parallax`),  metallicities (`Met_Harris`) and extinctions (`AV`).
* The uncertainties of the former quantities: `em`, `elogP`, `parallax_error` and `emet`.


 $\left(\hat{m}_{i},\log\hat{P}_{i},\hat{\varpi}_{i},\hat{Z}_{i},\hat{A}_{m_{i}}\right)$

Now, we are going to construct the HBM. Let us to present first the resulting DAG associated to the model. In the following paragraphs we detail its construction procedure.

<embed src="graph.svg" type="image/svg+xml" />


We start modelling the stochastic linear relationship 
 
$$
M_{i}=b+c\log P_{i}+kZ_{i}
$$  

where $M_{i}$, $\log P_{i}$ and $Z_{i}$ are, respectively, the absolute magnitude, the decadic logarithm of the intrinsic period and the intrinsic metallicity of the i-th star in the sample. Note that we do not know the values of the intrinsic (true) astrophysical parameters $\log P_{i}$ and $Z_{i}$, but we do know their measured values $\log\hat{P}_{i}$ and $\hat{Z}_{i}$ and associated uncertainties. To account for the stochastic nature of the former relation, we assign it an intrinsic width $w$. The main objective of the model is then to infer the parameters $b$, $c$, $k$ and $w$, where $c=\tan\left(\phi_{1}\right)$ and $k=\tan\left(\phi_{2}\right)$.

Now, we need to relate absolute magnitudes to apparent magnitudes. For that we use the deterministic relation
$$
m_{i}=M_{i}-5\log\varpi_{i}+10+A_{m_{i}}
$$
, where $m_{i}$ is the true reddened apparent magnitude and $\varpi_{i}$ is the true parallax measured in mas. Note, again, that we do not know the values of true reddened apparent magnitudes and parallaxes but we do know their measured values $\hat{m}_{i}$, $\hat{\varpi}_{i}$ and uncertainties.


At this point we put into the DAG associated to our HBM, a rectangle that replicates with the $n$ stars in the sample and include in it the nodes corresponding to extinctions $A_{m_{i}}$ and true astrophysical parameters $m_{i}$, $M_{i}$, $\log P_{i}$ and $Z_{i}$ , and the arcs between these nodes, which are given by the two relations above. For example, we draw solid (stochastic) arcs from $M_{i}$ to $\log P_{i}$ and $Z_{i}$ and dashed (deterministic) arcs from $m_{i}$ to $M_{i}$ and $\varpi_{i}$. Outside the rectangle, we include the parameters $b$, $\phi_1$, $\phi_2$ and $w$ of the PLZ relation and trace the arcs from $M_{i}$ to those nodes. We also set the distribution of the i-th absolute magnitude to a Gaussian of mean $b+c\log P_{i}+kZ_{i}$ and standard deviation $w$.

We continue the construction of the model and the DAG linking the measured astrophysical parameters to the true ones. So, we include $\hat{m}_{i}$, $\log\hat{P}_{i}$, $\hat{\varpi}_{i}$ and $\hat{Z}_{i}$ inside the  rectangle of the DAG and trace the corresponding solid (stochastic) arcs to the true parameters. To each measured astrophysical parameter we assign a Gaussian distribution with its mean equal to the value of the true parameter and its standard deviation equal to the uncertainty provided by our data set. For example, for the measured apparent magnitude we declare $\hat{m}_{i}\sim\mathsf{N}\left(m_{i},\sigma_{m_{i}}\right)$.  

Next, we assign prior probability distributions to the true astrophysical parameters $\log P_{i}$, $Z_{i}$ and $\varpi_{i}$ and trace the DAG arcs from these astrophysical parameters to the parameters of the distributions assigned to them. We set $\log P_{i}\sim\mathsf{N}\left(\mu_{P}=0,\sigma_{P}=2\right)$, $Z_{i}\sim\mathsf{N}\left(\mu_{Z}=0,\sigma_{Z}=5\right)$ and $\varpi_{i}\sim\mathsf{logN}\left(\beta,\gamma\right)$. Note that the hyperparameters of the $\log P_{i}$ and $Z_{i}$ distributions are constant quantities while the hyperparameters $\beta$ and $\gamma$ of the $\varpi_{i}$ distribution are random. For these latter ones we assign the hyperpriors $\beta\sim\mathsf{N}\left(\mu_{\beta}=0,\sigma_{\beta}=2\right)$ and $\gamma\sim\mathsf{Exp}\left(1\right)$.    

To finish the BGM construction, we assign the following prior distributions to the parameters of the PLZ relation:  $b\sim\mathsf{N}\left(\mu_{\beta}=0,\sigma_{\beta}=10\right)$ to the intercept,  $\phi_{1},\phi_{2}\sim\mathsf{N}\left(\mu_{\phi}=0,\sigma_{\phi}=3.14/2\right)$ to the slope angles and $w\sim\mathsf{Exp}\left(\lambda_{w}=1\right)$ to the spread. 


We define now the Stan model for our problem:

```{r}
stanPLZcode <-'
data {
  int<lower=1> N; // number of observations
  real pi[N];     // observed parallax
  real m[N];      // observed apparent magnitude 
  real logP[N];   // decadic logarithm of observed period 
  #real alpha[N];  // observed right ascension
  #real delta[N];  // observed declination
  real met[N];    // observed metallicity 
  real AV[N];
  
  real epi[N];    // measurement error of observed parallax
  real em[N];     // measurement error of observed (reddened) apparent magnitude 
  real elogP[N];  // measurement error of logarithm of observed period 
  real ealpha[N]; // measurement error of observed right ascension
  real edelta[N]; // measurement error of observed declination
  real emet[N];   // uncertainty of observed metallicity 
}

parameters {
  vector<lower=0>[N] pi0;   // true parallax
  vector[N] logP0;          // decadic logarithm of true period 
  vector[N] met0;           // true metallicity 
  vector[N] M;
  real angle;
  real b;
  real c;
  real beta;
  real gamma;
  real width;
  #real eAV;
}

model {

  #real ps;
  real a;
  real d;
  vector[N] emtot;
  #real emtot[N];

  width ~ exponential(1);
  beta ~ normal(0.0,2.0); 
  gamma ~ exponential(1);
  #eAV ~ normal(0,0.3); # RELATIVE error in extinction
  
  angle ~ normal(0,3.1416/2.0);
  b ~ normal(0,3.0);
  # c ~ normal(0,0.5);
  c ~ normal(0,3.1416/2.0);

  a = tan(angle);
  d = tan(c);


  for (i in 1:N) {
  pi0[i] ~ lognormal(beta,gamma);
  logP0[i] ~ normal(0.0,2);
  met0[i] ~ normal(0,5);
  M[i] ~ normal(a*logP0[i] + d*met0[i] + b, width); 
  #emtot[i] = sqrt(em[i]^2 + (0.114*AV[i]*fabs(eAV))^2);  

  #AV[i] ~ normal(0,0.2);
  # likelihood

  m[i] ~ normal(M[i] - (5.0*log10(pi0[i])) + 10.0 +AV[i] , em[i]);
  pi[i] ~ normal(pi0[i], epi[i]);
  logP[i] ~ normal(logP0[i],elogP[i]);
  met[i] ~ normal(met0[i],emet[i]);

  }



#  for (i in 1:N) {
#        ps = 
#          normal_lpdf(m[i]|(a*logP0[i]+d*met0[i]+b)+10.0-(5.0*log10(pi0[i]))  ,emtot[i]+width)+
#              normal_lpdf(pi[i]|pi0[i],epi[i])+
#              normal_lpdf(logP[i]|logP0[i],elogP[i])+
#              normal_lpdf(met[i]|met0[i],emet[i]);
#        target += ps;
#    }

}
'
```




We load the training set into a list, initialize Markov chains to random values, call the sampler, store the fit and extract posterior samples.

```{r}
# Create data list object

linear_data <- list(N=nrow(full), pi=full$parallax, epi=full$parallax_error,
                    m=full$K0, em=full$em,
                    logP=full$LogP, elogP=full$elogP,
                    alpha=full$ra, delta=full$dec,
                    ealpha=full$ra_error, edelta=full$dec_error,
                    met=full$Met_Harris, emet=full$emet,
                    AV=full$AV 
                    )


str(linear_data)


initf2 <- function(chain_id = 1) {
  set.seed(chain_id)

# Create empty variables
  initpi = rep(NA,n)
  initlogp = rep(NA,n)
  initmet = rep(NA,n)
  initM = rep(NA,n)
# Initialize parameters
  m=rnorm(1,0,1.0) 
  angle = atan(m)
  b=rnorm(1,-1.0,0.3)
  c = rnorm(1,0,0.5)
  d = atan(c)

  width=rexp(1,1.0)
  beta=rnorm(1,0.0,2.0)
  gamma=rexp(1,1.0)
  eAV=rexp(1,1.0)
  
  for (i in 1:n) {
    initlogp[i] = rnorm(1,full$LogP[i],0.01)
    initmet[i] = rnorm(1,full$Met_Harris,0.5)
    initM[i] = rnorm(1, m*full$LogP[i] + c*full$Met_Harris[i] + b, width)
    initpi[i] =  10^((initM[i]- full$K0[i]+10.0)/5.0) # Initialize almost at random (m,b,d can take almost any value)
    
    
  }



 list(pi0=initpi,
         logP0=initlogp, M=initM,
          angle=angle,
          b=b, beta=beta, gamma=gamma, c=c, width=width,eAV=eAV)
 
 #list(angle=angle, b=b, beta=beta, gamma=gamma, c=c, width=width)
 
 

} 

# generate a list of lists to specify initial values
#n_chains <- 4
n_chains <- 1
init_ll <- lapply(1:n_chains, function(id) initf2(chain_id = id))

nsamples_chain =5000 #LSB
#nsamples_chain =1000
nsamples <- n_chains*nsamples_chain/2

fit <- stan(model_code = stanPLZcode, data = linear_data, iter = nsamples_chain, 
            chains = n_chains,init=init_ll,control= list(adapt_delta = 0.8), verbose = TRUE,init_r=2)

#fit <- stan(model_code = stanPLZcode, data = linear_data, iter = nsamples_chain, 
#            chains = n_chains,init="random",control= list(adapt_delta = 0.8), verbose = TRUE,init_r=2)

#draws <- extract(fit)

```


Finally, we plot the marginal posterior distributions from the MCMC samples.The diagonal shows the uni-dimensional marginal distributions
for the intrinsic width of the relationship, the angle of the metallicity term in the linear relation, the intercept and the angle of the
log(P) term.



```{r}
pairs(fit, pars = c("width", "c", "b","angle"), labels=c("Intrinsic Width", "Metallicity Angle", "Intercept","log(P) Angle"), 
      log = FALSE, las = 1)
```
















 

